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Description 

This invention relates an adaptive filtering method for enhancing digitally processed speech or audio signals e.g. 
from a real-time coder for compression of digitally encoded speech or audio signals for transmission or storage, or 

5 more particularly from a real-time vector adaptive predictive coding system. 

In the past few years, most research in speech coding has focused on bit rates from 16 kb/s down to 150 bits/s. 
At the high end of this range, it is generally accepted that toll quality can be achieved at 16 kb/s by sophisticated 
waveform coders which are based on scalar quantization. N.S. Jayant and R Noll. Digital Coding of Waveforms. Pren- 
tice-Hall Inc., Englewood Cliffs, N.J., 1984. At the other end, coders (such as linear-predictive coders) operating at 

io 2400 bits/s or below only give synthetic-quality speech. For bit rates between these two extremes, particularly between 
4.8 kb/s and 9.6 kb/s, neither type of coder can achieve high-quality speech. Part of the reason is that scalar quantization 
tends to break down at a bit rate of 1 bit/sample. Vector quantization (VO), through its theoretical optimaiity and its 
capability of operating at a fraction of one bit per sample, offers the potential of achieving high-quality speech at 9.6 
kb/s or even at 4.8 kb/s. J. Makhoul, S. Roucos, and H. Gish, 'Vector Quantization in Speech Coding,' Proc. IEEE, 

« Vol. 73, No. 11, November 19B5. 

Vector quantization (VQ) can achieve a performance arbitrarily close to the ultimate rate-distortion bound if the 
vector dimension is large enough. T. Berger, Rate Distortion Theory . Prentice-Hall Inc., Englewood Cliffs.N.J., 1971. 
However, only small vector dimensions can be used in practical systems due to complexity considerations, and unfor- 
tunately, direct waveform VO using small dimensions does not give adequate performance. One possible way to im- 

20 prove the performance is to combine VQ with other data compression techniques which have been used successfully 
in scalar coding schemes. 

In speech coding below 16 kb/s, one of the most successful scalar coding schemes is Adaptive Predictive Coding 
(APC) developed by Atal and Schroeder [B.S. Atal and M.R. Schroeder, "Adaptive Predictive Coding of Speech Signals, 
' Bell Syst. Tech. J., Vol. 49, pp. 1 973-1 986, October 1 970; B.S. Atal and M.R. Schroeder, •Predictive Coding of Speech 

25 Signals and Subjective Error Criteria." IEEE Trans. Acoust., Speech, Signal Proc., Vbl. ASSP-27, No. 3, June 1979; 
and B.S. Atal, "Predictive Coding of Speech at Low Bit Rates," IEEE Trans. Comm., Vbl. COM-30, No. 4, April 1982]. 
It is the combined power of VQ and APC that led to the development of the present invention, a vector Adaptive 
Predictive Coder (VAPC). Such a combination of VQ and APC will provide high-quality speech at bit rates between 4.8 
and 9.6 kb/s, thus bridging the gap between scalar coders and VQ coders. 

30 The basic idea of APC is to first remove the redundancy in speech waveforms using adaptive linear predictors, 
and then quantize the prediction residual using a scalar quantizer. In VAPC, the scalar quantizer in APC is replaced 
by a vector quantizer VQ. The motivation for using VQ is two-fold. First, although linear dependency between adjacent • 
speech samples is essentially removed by linear prediction, adjacent prediction residual samples may still have non- 
linear dependency which can be exploited by VQ. Secondly, VQ can operate at rates below one bit per sample. This 

35 is not achievable by scalar quantization, but it is essential for speech coding at low bit rates. 

The vector adaptive predictive coder (VAPC) has evolved from APC and the vector predictive coder introduced by 
V Cuperman and A. Gersho, "Vector Predictive Coding of Speech at 16 kb/s," IEEE Trans. Comm.. Vbl. COM-33, pp. 
665-696, July 1985. VAPC contains some features that are somewhat similar to the Code-Excited Linear Prediction 
(CELP) coder by M.R. Schroeder, B.S. Atal, "Code-Excited Linear Prediction (CELP): High-Quality Speech at Very 

40 LowBit Rates,' Proc. Intl. Conf. Acoustics, Speech, Signal Proc.Jampa, March 1985, but with much less computational 
complexity. 

In computer simulations, VAPC gives very good speech quality at 9.6 kb/s. achieving 1 8 dB of signal-to-noise ratio 
(SNR) and 16 dB of segmental SNR. At 4.8- kb/s, VAPC also achieves reasonably good speech quality, and the SNR 
and segmental SNR are about 13 dB and 11.5 dB, respectively. The computations required to achieve these results 

*5 are only in the order of 2 to 4 million flops per second (one flop, a floating point operation, is defined as one multiplication, 
one addition, plus the associated indexing), well within the capability of today's advanced digital signal processor chips. 
VAPC may become a low-complexity alternative to CELP, which is known to have achieved excellent speech quality 
at an expected bit rate around 4.8 kb/s but is not presently capable of being implemented in real-time due to its astro- 
nomical complexity. It requires over 400 million flops per second to implement the coder. In terms of the CPU time of 

50 a supercomputer CRAY-1 , CELP requires 1 25 seconds of CPU time to encode one second of speech. There is currently 
a great need for a real-time, high-quality speech coder operating at encoding rates ranging from 4.8 to 9.6 kb/s. In this 
range of encoding rates, the two coders mentioned above (APC and CELP) are either unable to achieve high quality 
or too complex to implement. In contrast, the system of EP-A-0294020, from which the present application is divided, 
combines Vector Quantization (VQ) with the advantages of both APC and CELP, is able to achieve high-quality speech 

5S with sufficiently low complexity for real-time coding. 

The noise-masking effect of human auditory perception is exploited in many speech coders by using noise spectral 
shaping. However, in noise spectral shaping, lowering noise components at certain frequencies can only be achieved 
at the price of increased noise components at other frequencies. Therefore, at bit-rates where the average noise level 
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is quite high, it is very difficult, if not impossible, to force noise below the masking threshold at all frequencies. Since 
speech formants are much more important to perception than spectral valleys, the goal is to preserve the formant 
information by keeping the noise in the formant regions as low as is practical during encoding. Of course, in this case, 
the noise components in spectral valleys may exceed the threshold; however, these noise components can be atten- 

5 uated later by a postfilter. In performing such postfiltering, the .speech components in spectral valleys will also be 
attenuated. Fortunately, the limen, or just noticeable difference for the intensity of spectral valleys can be quite large. 
Therefore, by attenuating the components in spectral valleys, the postfilter only introduces minimal distortion in the 
speech signal, but it achieves a substantial noise reduction. 

Adaptive postfiltering has been used successfully in enhancing ADCM-coded speech. Such a postfilter reduces 

10 the overall noise level; however, sufficient noise reduction can only be achieved with severe muffling in the filtered 
speech. This is due to the fact that the frequency response of this postfilter generally has a tawpass spectral tilt for 
voiced speech. 

At a conference in Tokyo sponsored by the IEEE Acoustics, Speech and Signal Processing Society, the Institute 

of Electronics and Communications Engineers of Japan and the Acoustical Society of Japan, a variable rate APC 
15 coding system with Maximum Likelihood Quantization (MLQ) was presented [Y Yatsusuka et at., 'A Variable Rate 

Coding by APC with Maximum Likelihood Quantization From 4.8 KBit/s to 16 KBit/s, "Proc. ICASSP *B6. Vol. 4, pp. 

3071-74, April 6, 1986] with adaptive noise-shaping filters in both the encoder and the decoder. An all-pole filter is 

inserted around an adaptive quantizer in the coder and another in the decoder after decoding both filters consisting of 

long- and short-term predictors. 
20 An object of this inventionis to provide adaptive postfiltering of a speech or audio signal that has been corrupted 

by noise resulting from a coding system or other sources of degradation so as to enhance the perceived quality of said 

speech or audio signal. 

According to the invention there is provided an adaptive filtering method for enhancing digitally processed speech 
or audio signals at a receiver by filtering said digitally processed signals with short-delay filtering, said short-delay 

& filtering being controlled by predetermined linear-predictive coefficient (LPC) parameters; characterised in that said 
short-delay tillering uses a pole-zero transfer function consisting of the ratio of two all-pole transfer functions, with the 
zeros of said pole-zero transfer function having smaller radii than corresponding poles. 

The preferred embodiment provides postfiltering for use with a system which approximates each vector of K speech 
samples by using each of M fixed vectors stored in a VQ codebook to excite a time-varying synthesis filter and picking 

30 the best synthesized vector that minimizes a perceptually meaningful distortion measure. The original sampled speech 
is first buffered and partitioned into vectors and frames of vectors, where each frame is partitioned into N vectors, each 
vector having K speech samples. Predictive analysis of pitch-filtering parameters (?) linear-predictive coefficient filter- 
ing parameters (LPC), perceptual weighting filter parameters (W) and residual gain scaling factor (G) for each of suc- 
cessive frames of speech is then performed. The parameters determined in the analyses are quantized and reset every 

35 frame for processing each input vector s n in the frame, except the perceptual weighting parameter. A perceptual weight- 
ing filter responsive to the parameters W is used to help select the VQ vector that minimizes the perceptual distortion 
between the coded speech and the original speech. Although not quantized, the perceptual weighting filter parameters 
are also reset every frame. 

After each frame is buffered and the above analysis is completed at the beginning of each frame, M zero-state 
4 o response vectors are computed and stored in a zero-state response codebook. These M zero-state response vectors 
are obtained by setting to zero the memory of an LPC synthesis filter and a perceptual weighting filter in cascade after 
a scaling unit controlled by the factor G, and controlling the respective filters with the quantized LPC filter parameters 
and the unquantized perceptual weighting filter parameters, and exciting the cascaded filters using one predetermined 
and fixed codebook vector at a time. The output vector of the cascaded filters for each VQ codebook vector is then 
« stored in the corresponding address, i.e., is assigned the same index of a temporary zero-state response codebook 
as of the VQ codebook. In encoding each input speech vector s,, within a frame, a pitch prediction s n of the vector is 
determined by processing the last vector encoded as an index code through a scaling unit, LPC synthesis filter and 
pitch predictor filter controlled by the parameters QG, QLPC, QP and QPP for the frame. In addition, the zero-input 
response of the cascaded filters (the ringing from excitation of a previous vector) is first set in a filter. Once the pitch- 
so predicted vector s n is subtracted from the input signal vector s n , and a difference vector is passed through the 
perceptual weighting filter to produce a filtered difference vector f n , the zero-input response vector in the aforesaid 
filter is subtracted from the perceptual weight filtered difference vector f n , and the resulting vector v n is compared with 
each of the M stored zero-state response vectors in search of the one having a minimum difference A or distortion. 
The index (address) of the zero-state response vector that produces the smallest distortion, i.e., that is closest to 
« v n , identifies the best vector in the permanent codebook. Its index (address) is transmitted as the compressed code 
for the vector, and used by a receiver which has an identical VQ codebook as the transmitter to find the best-match 
vector. In the transmitter, that best-match vector is used at the time of transmission of its index to excite the LPC 
synthesis filter and pitch prediction filter to generate an estimate s^ of the next speech vector. The best-match vector 
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is also used to excite the zero-input response filter to set it for the next speech vector s, as described above The 
indicesot the best-match vector foraframe of vectors are combined in a multiplexer with the frame anaysis information 
hereinafter referred to as "side information." comprised of the indices of parameters which control pitch, pitch predictor 
and LPC predictor filtering and the gain used in the coding process, in order that it may be used by the receiver in 
decoding the vector indices of a frame into vectors using a codebook identical to the permanent codebook at the 
transmitter. This side information is preferably transmitted through the multiplexer first, once for each frame of VQ 
indices that follow, but it would be possible to first transmit a frame. o« vector indices, and then transmit the side infor- 
mation since the frames of vector indices will require some buffering in either case; the difference is only in some initial 
delay at the beginning of speech oraudio frames transmitted in succession. The resulting stream of multiplexed indices 
»o are transmitted over a communication channel to a decoder, or stored for later decoding. 

In the decoder, the bit stream is first demultiplexed to separate the side information from the indices that follow 
Each mdex is used at the receiver to extract the corresponding vector from the duplicate codebook. The extracted 
vector is first scaled by the gain parameter, using a table to convert the gain index to the appropriate scaling factor 
and then used to excite cascaded LPC synthesis and pitch synthesis filters controlled by the same side information 
used in selecting the best-match index utilizing the zero-state response codebook in the transmitter. The output of the 
pitch synthesis filter is the coded speech, which is perceptually close to the original speech. All of the side information 
except the gain information, is used in an adaptive postfilter to enhance the quality of the speech synthesized This' 
poslfiltenng technique may be used to enhance any voice or audio signal. All that would be required is an analysis 
section to produce the parameters used to make the postfilter adaptive, 
a* Although reference is made hereinafter only to speech, the invention described and claimed is applicable to audio 
waveforms or to sub-band filtered speech or audio waveforms. 

100Q Th8 PreSent ^P'^ 0 " fe a dMstonal application from European Patent Application 88303038.9 filed on 6 April 
198o. 

An example of the invention will now be described with reference to the accompanying drawings 
H&lafcabtockdiagramofaVectorA^^ 
of a receiver for the encoded speech transmitted by the system of FIG. la. 

FIG. 2 is a schematic diagram that illustrates the adaptive computation of vectors for a zercxstate response code- 
book in the system of FIG. 1a. 

RG. 3 is a block diagram of an analysis processor in the system of FIG. la. 

RG. 4 tea block diagram of an adaptive post filter according to the present invention, which may be used in the 
receiverof RG. 1b. 

FIG. 5 illustrates the LPC spectrum and the corresponding frequency response of an all-pole postfilter 1/11-P/z/ 
a)] for different values of a. The offset between adjacent plots is 20 dB. 

FIG. 6 illustrates the frequency responses of the postfilter [1-nz-'I1-P(z/B)]/[i.p ( z/ a)] corresponding to the LPC 
spectrum shown in RG. 5. In both plots. o=0.8 and 6=0.5. The offset between the two plots is 20 dB 

Referring to FIG. 1e, original speech samples. s„ in digital form from sampling analog-to-digital converter 10 are 
received by an analysis processor 11 which partitions them into vectors s,, of K samples per vector, and Into frames 
of N vectors per frame. The analysis processor stores the samples in a dual buffer memory which has the capacity for 
stonng more than one frame of vectors, for example two frames of 8 vectors per frame, each vector consisting of 20 
samples, so that the analysis processor may compute parameters used for coding the following frame. As each frame 
is being processed out of one buffer, a new frame coming in is stored in the other buffer so that when processing of a 
frame has been completed, there is a new frame buffered and ready to be processed. 

The analysis processor determines the parameters of filters employed in the Vector Adaptive Predictive Codino 
technique. 8 

These parameters are transmitted through a multiplexer 12 as side information just ahead of the frame of vector 
codes generated with the use of a vector quantized (VQ) permanent codebook 13 and a zero-state response (ZSR) 
codebook 14. The side information conditions the receiver to properly filter decoded vectors of the frame. The analysis 
processor 11 also computes other parameters used in the encoding process. The tatter are represented In RG la by 
dashed lines, and consist of sets of parameters which are designated W fora perceptual weighting filter 18, a quantized 
LPC predictor QLPC tor an LPC synthesis filter1S,andquantizedpitchQP and pitch predictorQPPfora pitch synthesis 
filter16. Also computed by the analysis processor is a scaling factor G for control of a scaling unit 17. The four quantized 
parameters transmitted as side information are encoded using a quantizing table as the quantized pitch index pitch 
predictor index, LPC predictor index and gain index. The manner in which the analysis processor computes all of'these 
parameters will be described with reference to RG. 3. 

The multiplexer 12 preferably transmits the side information as soon as it is available, although ft could follow the 
frame of encoded input vectors, and while that is being done, M zero-state response vectors are computed for the 
zero-state response (ZSR) codebook 14 in a manner illustrated in RG. 2, which is to process each vector in the VQ 
codebook. 13 e.g., 128 vectors, through a gain scaling unit 17", an LPC synthesis filter 15'. and perceptual weighting 
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filters 1B' corresponding to the gain scaling unit 17, the IPC synthesis filler 15. and perceptual weighting filter 18 in 
the transmitter (FIG. 1a). Ganged commutating switches S, and S2 are shown to signify that each fixed VQ vector 
processed is stored in memory locations of the same index (address) in the ZSR codebook. 

At the beginning of each vector processing, the initial conditions of the cascaded filters 15' and 18* are set to zero, 
s This simu lates what the cascaded fitters 1 5* and 1 8*. wilt do with no previous vector present from its corresponding VQ 
codebook. Thus, if the output of a -zero-input response fitter 19 in the transmitter (FIG. la) is held or stored, 

at each step of computing the VQ code index (to transmit for each vector of a frame), it is possible to simplify 
encoding the speech vectors by subtracting the zero-state response output from the vector f n . In other words, assuming 
M=128, there are 128 different vectors permanently stored in the VQ codebook to use in coding the original speech 

to vectors s n . Then every one of the 128 VQ vectors is read out in sequence, fed through the scaling unit 17\ the LPC 
synthesis filter 15\ and the perceptual weighting filter 18' without any history of previous vector inputs by resetting 
those filters at each step. The resulting filter output vector is then stored in a corresponding location in the zero-state 
response codebook. Later, while encoding input signal vectors s„ by finding the best match between a vector v n and 
all of the zero state response vector codes, it is necessary to subtract from a vector f rt derived from the perceptual 
weighting filter a value that corresponds to the effect of the previously selected VQ vector. That is done through the 
zero-input response fitter 19. The index (address) of the best match is used as the compressed vector code transmitted 
for the vector s ft . Of the 12B zero-state response vectors, there will be only one that prothe best match, i.e., least 
distortion. Assume it is in location 38 of the zero-state response codebook as determined by a computer 20 labeled 
•compute norm." An address register 20a will store the index 38. It is that index that is then transmitted as a VQ index 

so to the receiver shown in FIG. 1b. 

In the receiver, a demultiplexer 21 separates the side information which, conditions the receiver with the same 
parameters as corresponding filters and scaling unit of the transmitter. The receiver uses a decoder 22 to translate the 
parameters indices to parameter values. The VQ index for each successive vector in the frame addresses a VQ code- 
book 23 which is identical to the fixed VQ codebook 13 of the transmitter. The LPC synthesis filter 24, pitch synthesis 

& filter 25, and scaling unit 26 are conditioned by the same parameters which were used in computing the zero-state 
codebook values, and which were in turn used in the process of selecting the encoding index for each input vector At 
each step of finding and transmitting an encoding index, the zero-input response filter 1 9 computes from the VQ vector 
at the location of the index transmitted a value to be subtracted from the input vector f n to present a zero-input response 
to be used in the best-match search. 

30 There are various procedures that may be used to determine the best match for an input vector s„. The simplest 
is to store the resulting distortion between each zero-state response vectorcode output and the vector v n with the index 
of that. Assuming there are 128 vectorcodes stored in the codebook 1 4, there would then be 1 28 resulting distortions 
stored in a best address computer 20. Then, after all have been stored, a search is made in the computer 20 for the 
lowest distortion value. Its index is then transmitted to the receiver as an encoded vector via the multiplexer 12, and 

35 to the VQ codebook for reading the corresponding VQ vector to be used in the processing of the next input vector s„. 
In summary, it should be noted that the VQ codebook is used (accessed) in two different steps: first, to compute 
vector codes for the zero-state response codebook at the beginning of each frame, using the LPC synthesis and 
perceptual weighting filter parameters determined for the frame; and second, to excite the filters 15 and 16 through 
the scaling unit 17 while searching for the index of the best-match vector, during which the estimate s n thus produced 

*o is subtracted from the input vector a„. The difference d„ is used in the best-match search. 

As the best match for each input vector s^ is found, the corresponding predetermined and fixed vector from the 
VQ codebook is used to reset the zero input response filter 19 for the next vector of the frame. The function of the 
zero-input response filter 19 is thus to find the residual response of the gain scaling unit 17* and fitters 15* and 18' to 
previously selected vectors from the VQ codebook. Thus, the selected vector is not transmitted; only its index is trans- 

<s mitted At the receiver its index is used to read out the selected vector from a VQ codebook 23 identical to the VQ 
codebook 1 3 in the transmitter. 

The zero-input response filter 19 is the same filtering operation that is used to generate the ZSR codebook, namely 
the combination of a gain G, an LPC synthesis fitter and a weighting filter, as shown in FIG. 2. Once a best codebook 
vector match is determined, mined, the best-match vector is applied as an input to this filter (sample by sample, se* 

so quentialty). An input switch Sj is closed and an output switch s c is open during this time so that the first K output samples 
are ignored. (K is the dimension of the vector and a typical value is 20.) As soon as all K samples have been applied 
as input to the filter, the filter input switch s, is opened and the output switch ^ is closed. The next K samples of the 
vector \ w the output of the perceptual weighting filter, begin to arrive and are subtracted from the samples of the vector 
The difference so generated is a set of K samples forming the vector v n which is stored in a static register for use 

55 in the ZSR codebook search procedure. In the ZSR codebook search procedure, the vector v„ is subtracted from each 
vector stored in the ZSR codebook, and the difference vector a is fed to the computer 20 together with the index (or 
stored in the same order), thereby to imply the index of the vector out of the ZSR codebook. The computer 20 then 
determines which difference is the smallest, i.e., which is the best match between the vector v 0 and each vector stored 
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temporarily (for one frame of input vectors sj. The index of that best-match vector is stored in a register 20a. That 
index is transmitted as a vectorcode and used to address the VQ codebook to read the vector stored there into the 
scaling unit 17, as noted above. This search process is repeated for each vector in the ZSR code-book, each time 
using the same vector v n . Then the best vector is determined. 

Referring now to FIG. 1b, it should be noted that the output of the VQ codebook 23, which precisely duplicates 
the VQ codebook 13 of the transmitter, is identical to the vector extracted from the best-match index applied as an 
address to the VQ codebook 13; the gain unit 26 is identical to the gain unit 17 in the transmitter, and filters 24 and 25 
exactly duplicate the filters 15 and 16, respectively, except that at the receiver, the approximation 3 n rather than the 
prediction s n is taken as the output of the pitch synthesis filter 25. The result, after converting from digital to analog 
form, is synthesized speech that reproduces the original speech with very good quality. 

It has been found that by applying an adaptive postfifter 30 to the synthesized speech before converting it from 
digital to analog form, the perceived coding noise may be greatly reduced without introducing significant distortion in 
the filtered speech. FIG. 4 illustrates the preferred organization of the adaptive postfilter as a long-delay filter 31 and 
a short-delay filter 32. Both filters are adaptive in thatthe parameters used in themare those received as side information 
from the transmitter, except for the gain parameter, G. The basic idea of adaptive postfiltering is to attenuate the 
frequency components of the coded speech in spectral valley regions. At low bit rates, a considerable amount of 
perceived coding noise comes from spectral valley regions where there are no strong resonances to mask the noise. 
The postfilter attenuates the noise components in spectral valley regions to make the coding noise less perceivable. 
However, such filtering operation inevitably introduces some distortion to the shape of the speech spectrum. Fortu- 
nately, our ears are not very sensitive to distortion in spectral valley regions; therefore, adaptive postfiltering only 
introduces very slight distortion in perceived speech, but it significantly reduces the perceived noise level. The adaptive 
postfilter will be described in greater detail after first describing in more detail the analysis of a frame of vectors to 
determine the side information. 

Referring now to FIG. 3, it shows the organization of the initial analysis of block 11 in FIG. 1a. The input speech 
samples s n are first stored in a buffer 40 capable of storing, lor example, more than one frame of 8 vectors, each vector 
having 20 samples. 

Once a frame of input vectors s n has been stored, the parameters to be used, and their indices to be transmitted 
as side information, are determined from that frame and at least a part of the previous frame in order to perform analysis 
with information from more than the frame of interest. The analysis is carried out as shown using a pitch detector 41 , 
pitch quantizer 42 and a pitch predictor coefficient quantizer 43. What is referred to as "p-itch" applies to any observed 
periodicity in the input signal, which may not necessarily correspond to the classical use of 'pitch' corresponding to 
vibrations in the human vocal folds. The direct output of the speech is also used in the pitch predictor coefficient 
quantizer 43. The quantized pitch (QP) and quantized pitch predictor (QPP) are used to compute a pitch-prediction 
residual in block 44, and as control parameters for the pitch synthesis filter 16-used as a predictor in FIG. la. Only a 
pitch index and a pitch prediction index are included in the side information to minimize the number of bits transmitted. 
At the receiver, the decoder 22 will use each index to prothe corresponding control parameters for the pitch synthesis 
filter 25. 

The pitch-prediction residual is stored in a buffer 45 for LPC analysis in block 46. The LPC predictor from the LPC 
analysis is quantized in block 47. The index of the quantized LPC predictor is transmitted as a third one of four pieces 
of side information, while the quantized LPC predictor is used as a parameter for control of the LPC synthesis filter 15, 
and in block 48 to compute the rms value of the LPC predictive residual. This value (unquantized residual gain) is then 
quantized in block 49 to provide gain control G in the scaling unit 17 of FIG. 1a. The index of the quantized residual 
gain is the fourth part of the side information transmitted. 

In addition to the foregoing, the analysis section provides LPC analysis in block 50 to produce an LPC predictor 
from which the set of parameters W for the perceptual weighting filter 18 (FIG. la) is computed in block 51. 

The adaptive postfilter 30 in FIG. 1 b will now be described with reference to FIG. 4. It consists of a long-delay filter 
31 and a short-delay filter 32 in cascade. The long-delay filter is derived from the decoded pitch-predictor information 
available at the receiver. It attenuates frequency components between pitch harmonic frequencies. The short-delay 
filter is derived from LPC predictor information, and it attenuates the frequency components between formant frequen- 
cies. 

The noise masking effect of human auditory perception, recognized by M.R. Schroeder, B.S. Atal, and J.L Hall, 
^Optimizing Digital Speech Coders by Exploiting Masking Properties of the Human Ear,' J. Acoust. Soc. Am., Vol. 66,' 
No. 6, pp. 1647-1652, December 1 979, is exploited in VAPC by using noise spectral shaping. However, in noise spectral 
shaping, lowering noise components at certain frequencies can only be achieved at the price of increased noise com- 
ponents at other frequencies. [B.S. Atal and M.R. Schroeder. "Predictive Coding of Speech Signals and Subjective * 
Error Criteria," IEEE Trans. Acoust., Speech, and Signal Processing, vol. ASSP-27, No. 3, pp. 247-254, June 1979] 
Therefore, at bit rates as low as 4800 bps, where the average noise level is quite high, it is very difficult, if not impossible, 
to force noise below the masking threshold at all frequencies. Since speech formants are much more important to 
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perception than spectral valleys, the approach of the present invention Is to preserve the formant information by keeping 
the noise in the formant regions as low as is practical during encoding. Of course, in this case, the noise components 
in spectral valleys may exceed the threshold; however, these noise components can be attenuated later by the postfiller 
30. In performing such postfiftering, the speech components in spectral valleys will also be attenuated. Fortunately, 
the limen, or 'just noticeable difference,' for the intensity of spectral valleys can be quite large [J.L Flanagan, Speech 
Analysis. Synthesis, and Perception, Academic Press, New York, 1972]. Therefore, by attenuating the components in 
spectral valleys, the postfiller only introduces minimal distortion in the speech signal, but it achieves a substantial noise 
reduction. 

Adaptive postfiftering has been used successfully in enhancing ADPCM-coded speech. See V. Ramamoorthy and 
J.S. Jayant, •Enhancement of ADPCM Speech by Adaptive Postfiltering,' AT&T Bell Labs Tech. J., pp. 1465-1475, 
October 1984; and N.S. Jayant and V. Ramamoorthy, 'Adaptive Postfiltering of 16 kb/s-ADPCM Speech," Proc. ICAS- 
SP, pp. 829-832, Tokyo, Japan. April 1986. The postfiller used by Ramamoorthy. et at., supra, is derived from the two- 
pole six-zero ADPCM synthesis filter by moving A the poles and zeros racially toward the origin. If this idea is extended 
directly to an all-pole LPC synthesis filter 1/[1-P(z)). the result is 1/p-P(z/a)] as the corresponding postfiller, where 
0<a<1. Such an all-pole postfilter indeed reduces the perceived noise level; however, sufficient noise reduction can 
only be achieved with severe muffling in the filtered speech. The is due to the fact that the frequency response of this 
all-pole postfilter generally has a towpass spectral tilt for voiced speech. 

The spectral tilt of the all-pole postfilter 1/[1 -P(z/a)] can be easily reduced by adding zeros having the same phase 
angles as the poles but with smaller radii. The transfer function Of the result ing pole-zero postfilter 32a has the form 



where a and p are coefficients empirically determined, with some tradeoff between spectral peaks being so sharp as 
to produce chirping and being so low as to not achieve any noise reduction. The frequency response of H(z) can be 
expressed as 



Therefore, in logarithmic scale, the frequency response of the pole-zero postfilter H(z) Is simply the difference between 
the frequency responses of two all-pole postfilters. 

Typical values of a and p are 0.8 and 0.5. respectively. From FIG. 5, it is seen that the response for a=0.8 has 
both formant peaks and spectral tilt, while the response for o=0.5 has spectral tilt only. Thus, with o=0.8 and p=0.5 in 
Equation 2. we can at least partially remove the spectral tilt by subtracting the response for a=0.5 from the response 
for a=0.8. The resulting frequency response of H(z) is shown in the upper plot of FIG. 6. 

In informal listening tests, it has been found that the muffling effect was significantly reduced after the numerator 
term [1-P(z/p)] was included in the transfer function H(z). However, the filtered speech remained slightly muffled even 
with the spectral-tilt compensating term p-P(z/p)). To further reduce the muffling effect, a first-order filter 32b was 
added which has a transfer function of (l^ur 1 ), where u is typically 0.5. Such a filter provides a slightly highpassed 
spectral tilt and thus helps to reduce muffling. This first-order filter is used in cascade with H(z), and a combined 
frequency response with u=0.5 is shown in the lower plot of FIG. 6. 

The short-delay postfilter 32 just described basically amplifies speech formants and attenuates inter-formant val- 
leys. To obtain the ideal postfilter frequency response, we also have to amplify the pitch harmonics and attenuate the 
valleys between harmonics. Such a characteristic of frequency response can be achieved with a long-delay postfirter 
using the information in the pitch predictor. 

In VAPC, we use a three-tap pitch predictor; the pitch synthesis filter corresponding to such a pitch predictor is not 
guaranteed to be stable. Since the poles of such a synthesis filter may be outside the unit circle, moving the poles 
toward the origin may not have the same effect as in a stable LPC synthesis filter. Even if the three-tap pitch synthesis 
fitter is stabilized, its frequency response may have an undesirable spectral tilt. Thus, it is not suitable to obtain the 
long-delay postfiller by scaling down the three tap weights of the pitch synthesis filter. 

With both poles and zeroes, the lono/delay postfilter can be chosen as 



20log|H(e i ")|=20log 





(2) • 
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W-C^ (3, 

s where p is determined by pilch analysis, and C g is an adaptive scaling factor. 

Knowing the information provided by a single or three-tap pitch predictor as the value bg or the sum of b, +02+03, 
the factors 7 and X are determined according to the following formulas: 

to Y = C 2 f(x),X = C p f(x) ( 0<C 2 ,C p <1 (4 ) 



is 



20 



45 



SO 



55 



1 if x > 1 
f(x) - x if U th < x s 1 (5) 
0 if x < U th 

where Uth is a threshold value (typically 0.6) determined empirically, and x can be either b 2 or b, +b2+b 3 depending on 
whether a one-tap or a three-tap pitch predictor is used. Since a quantized three-tap pitch predictor is preferred and 
therefore already available at the VAPC receiver, x is chosen as 



3 
I 
i-l 



in VAPC postfiltering. On the other hand, if the postfilter is used elsewhere to enhance noisy input speech; a separate 
pitch analysis is needed, and x may be chosen as a single value bg since a one-tap pitch predictor suffices. (The value 
b 2 when used alone indicates a value from a single-tap predictor, which in practice would be the same as a three-tap 
predictor when b, and b 3 are set to zero.) 

The goal is to make the power of (y(n)} about the same as that of (s(n)J. An appropriate scaling factor is chosen as 



C 1 -v* 



The first-order filter 32b can also be made adaptive to better track the change in the spectral tilt of H(z). However, 
it has been found that even a fixed filter with urO.5 gives quite satisfactory results. A fixed value of \i may be determined 
empirically. 

To avoid occasional large gain excursions, an automatic gain control (AGC) was added at the output of the adaptive 
postfilter. The purpose of AGC is to scale the enhanced speech such that it has roughly the same power as the unfiltered 
noisy speech. It is comprised of a gain (volume) estimator 33 operating on the speech input s(n), a gain (volume) 
estimator 34 operating on the postfiltered output r(n), and a circuit 35 to compute a scaling factor as the ratios of the 
two gains. The postfiltering output r(n) is then multiplied by this ratio in a multiplier 36. AGC is thus achieved by esti- 
mating the power of the unfiltered and filtered speech separately and then using the ratio of the two values as the 
scaling factor. Let {s(n)} be the sequence of either unfiltered or filtered speech samples; then, the speech power <j2(n) 
is estimated by using 

o 2 (n)=Co 2 (n-i) + (i-C) S 2 (n).0<C<i. (7) 

A suitable value of £ is 0.99. 

The complexity of the postfilter described m this section is only a small fraction of the overall complexity of the rest 
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of the VAPC system, or any other coding system that may be used. In simulations, this postfilter achieves significant 
noise reduction with almost negligible distortion in speech. To test for possible distorting effects, the adaptive postfil- 
tering operation was applied to clean, uncoded speech and it was found that the unfiltered original and its filtered 
version sound essentially the same, indicating that the distortion introduced by this postfilter is negligible. 

5 it should be noted -that although this novel postfiltering technique was developed for use with the VAPC system, 
its applications are not restricted to use with it. In fact, this technique can be used not only to enhance the quality of 
any noisy digital speech signal but also to enhance the decoded speech of other speech coders when provided with 
a buffer and analysis section for determining the parameters. 

What has been disclosed in the parent application from which the present application is divided, is a real-time 

10 Vector Adaptive Predictive Coder (VAPC) for speech or audio which may be implemented with software using the 
commercially available AT&T DSP32 digital processing chip). In its newest version, this chip has a processing power 
of 6 million instructions per second (MIPS). To facilitate implementation for real-time speech coding, a simplified version 
of the 4800 bps VAPC is available. This simplified version has a much lower complexity, but gives nearly the same 
speech quality as a full complexity version. 

15 

Claims 

1. An adaptive filtering method for enhancing digitally processed speech or audio signals at a receiver by filtering 
20 said digitally processed signals with short-delay filtering, said short-delay filtering being controlled by predeter- 
mined linear-predictive coefficient (LPC) parameters; characterised in that said short-delay filtering uses a pole- 
zero transfer function consisting of the ratio of two all-pole transfer functions, with the zeros of said pole-zero 
transfer function having smaller radii than corresponding poles. 

25 2. An adaptive filtering method as defined in claim 1 wherein said short-delay filtering is carried out in cascade with 
long<lelay filtering controlled by predetermined pitch and pitch predictor parameters. 

3. An adaptive filtering method as defined in claim 1 or 2 including first order filtering with a transfer function f-ux 1 , 
u.< 1 in cascade with said short-delay filtering. 

30 

4. A method as defined in claim 1 or 2 wherein said parameters are predetermined at a transmitter by analysis of 
digital speech or audio signals before processing for transmittal to said receiver and said parameters are trans- 
mitted as side information to said receiver along with said digitally processed speech or audio signals. 

35 S. A method as defined in claim 1 or 2 wherein said parameters are predetermined at said receiver by performing 
analysis of digitally processed speech or audio signals received. 

6. A method as defined in claim 2 wherein said LPC parameters are predetermined at said transmitter by analysis 
of digital speech or audio signals before processing and transmitting as side information to said receiver, and said 

*o pitch and pitch predictor parameters are predetermined at said receiver by performing analysis of digitally proc- 
essed speech or audio signals received. 

7. A method as defined in any of claims 1 to 6 including automatic gain control of said digitally processed signal after 
filtering by computing a value orfn) proportional to volume of filtered speech or audio signals and a value Oj(n) 

«5 proportional to volume of speech or audio signals before filtering and controlling the gain of the filtered speech or 
audio signals by a ratio of o rfn) to orfn). 

& A method as claimed in claim 2 wherein postfiltering is accomplished by using a transfer function for said long- 
delay postfilter of the form 

50 

c .i±J2? 

55 where C g is an adaptive scaling factor, and the factors y and X are determined according to the following formulas 

y r C z f(x), X = C p f(x), 0 < C z . C p < 1 



9 



EPO 503 684 B1 



where 



lifz>l 
' f(z)=xifU ft sxsi 
Oifx<U th 

U rt is a threshold value and x can be either bj or b^+ba depending upon whether a one-tap or three-lap pitch 
predictor is used. 



Revendlcations 

1. Proceed de filtrage adaptatif pour ameiiorer des signaux vocaux ou audio trails numeriquement, dans un ricep- 
teur, par filtrage desdits signaux traites numeriquement avec un filtrage produisant un bref retard, ledit filtrage 
introduisant un bref retard 6tant commande par des parametres de coefficients determines de prediction lineaire 
(LPC); caractdrisd en ce que ledit filtrage produisant un bref retard utilise une fonction de transfert a pdles zeros 
constitue e par le rapport des deux fonctions de transfert tous poles, les zeros de ladite fonction de transfert a pdles 
zeros ayant des rayons plus petits que les p6les correspondants. 

2. Precede" de filtrage adaptatif selon la revendication 1, selon lequel ledit filtrage introduisant un bref retard est 
execute en cascade avec un filtrage produisant un long retard, commands par des parametres de pas et de pre- 
dicteurs de pas pr6d6termin6s. 

3. Precede de filtrage adaptatif selon la revendication 1 ou 2, comprenant un filtrage du premier ordre avec une 
fonction de transfert 1-ur 1 , u < 1 en cascade avec ledit filtrage produisant un bref retard. 

4. Precede de filtrage adaptatif selon la revendication 1 ou 2, selon lequel lesdits parametres sont predetermines 
dans un 6metteur - recepteur par analyse des signaux vocaux ou audio num6riques avant le traitement pour 
remission vers ledit recepteur, et lesdits parametres sont emis en tant qu'information secondare vers ledit recep- 
teur conjointement avec lesdits signaux vocaux ou audio traites num6riquement 

5. Proc6d6 de filtrage adaptatif selon la revendication 1 ou 2, selon lequel lesdits parametres sont predetermines 
dans ledit recepteur par execution dune analyse de signaux vocaux ou audio traites numeriquement, recus. 

6. Precede de filtrage adaptatif seton la revendication 2, selon lequel lesdits parametres LPC sont predetermines 
dans ledit emetteur par analyse de signaux vocaux ou audio numeriques avant traitement et emission en tant 
qu'information secondare audit recepteur, et lesdits parametres de pas et de pr6dicteurs de pas sont predetermi- 
nes dans ledit recepteur par execution de I'analyse de signaux vocaux ou audio traites numeriquement, recus. 

7. Proc6dede filtrage adaptatif selon r une queteonque des revendlcations 1 66, incluantunecommandeautomatique 
de gain dudh signal traite numeriquement apr6s filtrage, par calcul rfune valeur o 2 (n) proportionnelle au volume 
de signaux vocaux ou audio filtres et rfune valeur o t (n) proportionnelle au volume de signaux vocaux ou audio 
avant le filtrage, et commando du gain des signaux vocaux ou audio filtres au moyen dun rapport de o t (n) a o 2 (n). 

a Precede selon la revendication 2, selon lequel un post-filtrage est execute en utilisant une fonction de transfert 
pour ledit filtre aval produisant un long retard, sous la forme 

C g 6tantunfMteurtf^elleadaptatMtlesf^^ 

Y = C I f(x).l = C p f(x).0<C r C |) <1 
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avec 



1 si z > 1 

* ffZ^XSiU^XSl 



Osix<U ft 



Uth dtant une valeur de seuil et x pouvant etre soil bg, soft d,*^^ en lonction du fait qu'on utilise un prddicteur 
de pas a une prise ou a trois prises. 



Patentanspruche 

1. Adaptives Filterverfahren zum Verbessem digital verarbeiteter Sprach- oder Tonsignale bei einem Empfanger 
durch Filterung der digital verarbeiteten Signale mrttels Kurz-Verzogerung-Fitterung, wobei die Kurz-Verzogerung- 
Filterung durch vorbestimmte Linear-Pradiktionskoeffizientenparameter (LPC) gesteuert wird, dadurch gekenn- 
zeichnet, daQ die Kurz-Verzogerung-Filterung eine Obertragungsfunktion mit Polen und Nullstellen verwendet, die 
aus dem Verhaltnis von zwei Obertragungsfunktionen mil nur Polen besteht, wobei die Nullstellen der Obertra- 
gungsfunktion mit Polen und Nullstellen kteinere Radien aufweisen als entsprechende Pole. 

2. Adaptives Filterverfahren nach Anspruch 1 , worin die Kurz-Verzogerung-Filterung in Reihe mit Lang-Verzogerung- 
Filterung durchgefOhrt wird, die durch vorbestimmte Tonhdhen- und Tonhohenpradildionsparameter gesteuert 
wird. 

3. Adaptives Filterverfahren nach Anspruch 1 oder 2. umfassend Filterung erster Ordnung mit einer Obertragungs- 
funktion 1-\ir ]i<1 in Reihe mit der Kurz-Verzogerung-Filterung. 

4. Verfahren nach Anspruch 1 oder 2, worin die Parameter bei einem Sender vorbestimmt warden durch Analyse 
digitaler Sprach- oder Tonsignale vor Verarbeitung zur Ubertragung an den Empfanger und dafl die Parameter als 
Nebeninformation zusammen mit den digital verarbeiteten Sprach- oder Tonsignalen an den Empfanger ubertra- 
gen werden. 

5. Verfahren nach Anspruch 1 oder 2, worin die Parameter bei dem Empfanger vorbestimmt werden durch Austuhren 
einer Analyse empfangener digital verarbeiteter Sprach- oder Tonsignale. 

6. Verfahren nach Anspruch 2, worin die LPC-Parameter bei dem Sender vorbestimmt werden durch Analyse digitaler 
Sprach- oder Tonsignale vor Verarbeitung und Obertragung als Nebeninformation an den Empfanger und die Ton- 
hohen- und Tonhohenpradildionsparameter bei dem Empfanger vorbestimmt werden durch Ausf uhren einer Ana- 
lyse digital verarbeiteter empfangener Sprach- Oder Tonsignale. 

7. Verfahren nach einem der AnsprOche 1 bis 6, umfassend automat ische Gewinnsteuerung des digital verarbeiteten 
Signals nach Filterung durch Berechnen eines Wertes o 2 (n), der proportional ist zum Pegel von gefilterten Sprach- 
oder Tonsignalen, und eines Werts o,(n), der proportional ist zum Pegel von Sprach- oder Tonsignalen vor Filte- 
rung, und Steuem des Gewinns der gefilterten Sprach- oder Tonsignale mit einem Verhaltnis aus o,(n) zu o^n). 

a Verfahren nach Anspruch 2, worin Nachf ilte rung erzielt wird durch Verwendung einer Obertragungsfunktion fur die 
lang-Verzdgerong-Nachfilterung, wefche die Geslalt 

c i±£? 

aufweist, wobei C g ein adapliver Skalierungsfaktor ist und die Faktoren y und X gemaB der folgenden Formeln 

y = c I f(x) l X = c p f(x),0<c r c p <1 
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bestimmt warden, wobei 

1 wennz>l 
*f{2) = xwennU lh <x<1 
0 wenn x < U A 

ist und U ft ein Schweilenwert ist und x entweder Oder + bg + ^ sein kann, je nachdem ob ein Tonhohenpra- 
diktor mit einem KnotenanschluG Oder mil drei Knotenanschlussen verwendet wird. 
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